Maintenance
Start R session and Rstudio
Continue your work using project settings from yesterday.
Load the libraries
Grab libraries needed for practical:
set.seed(1234)
if (!require("tidyverse")) install.packages("tidyverse"); library("tidyverse")
if (!require("sf")) install.packages("sf"); library("sf")
if (!require("tmap")) install.packages("tmap"); library("tmap")
if (!require("tmaptools")) install.packages("tmaptools"); library("tmaptools")
if (!require("sjmisc")) install.packages("sjmisc"); library("sjmisc")
if (!require("skimr")) install.packages("skimr"); library("skimr")
Notice the function set.seed. It is used to set the seed for random number generator and makes sure we all get the same results in situations when they are used, such as selecting random observations from data.
Data
Load data prepared during previous session:
SA2_SEIFA <- readRDS("data/SA2_SEIFA.Rds")
Data: Airbnb in Melbourne
Reading the data
listings <- read_csv("./data/listings.csv")
## Parsed with column specification:
## cols(
## id = col_double(),
## name = col_character(),
## host_id = col_double(),
## host_name = col_character(),
## neighbourhood = col_character(),
## latitude = col_double(),
## longitude = col_double(),
## room_type = col_character(),
## price = col_double()
## )
Data are stored in the form of “comma separated values” (csv) file - text file where values are separated by commas and new lines. This is common format to exchange the data since it offers simplicity and can be read by virtually any analytical software. We can read it into R in several ways - we choose read_csv function of readr package.
Examine the data
First five lines of the file will give us some idea of what data we have:
## # A tibble: 5 x 9
## id name host_id host_name neighbourhood latitude longitude room_type
## <dbl> <chr> <dbl> <chr> <chr> <dbl> <dbl> <chr>
## 1 9835 Beau~ 33057 Manju Manningham -37.8 145. Private ~
## 2 10803 Room~ 38901 Lindsay Moreland -37.8 145. Private ~
## 3 12936 St K~ 50121 Frank & ~ Port Phillip -37.9 145. Entire h~
## 4 15246 Larg~ 59786 Eleni Darebin -37.8 145. Private ~
## 5 16760 Melb~ 65090 Colin Port Phillip -37.9 145. Private ~
## # ... with 1 more variable: price <dbl>
More information can be obtained with skim function of skimr package:
skim(listings) %>% skimr::kable()
Skim summary statistics
n obs: 22895
n variables: 9
Variable type: character
| host_name |
3 |
22892 |
22895 |
1 |
35 |
0 |
5817 |
| name |
3 |
22892 |
22895 |
1 |
188 |
0 |
22448 |
| neighbourhood |
0 |
22895 |
22895 |
4 |
17 |
0 |
30 |
| room_type |
0 |
22895 |
22895 |
11 |
15 |
0 |
3 |
Variable type: numeric
| host_id |
0 |
22895 |
22895 |
7.1e+07 |
6.5e+07 |
9082 |
1.7e+07 |
4.8e+07 |
1.1e+08 |
2.3e+08 |
|
| id |
0 |
22895 |
22895 |
1.9e+07 |
8141522.23 |
9835 |
1.3e+07 |
2e+07 |
2.5e+07 |
3.1e+07 |
|
| latitude |
0 |
22895 |
22895 |
-37.83 |
0.067 |
-38.22 |
-37.85 |
-37.82 |
-37.8 |
-37.48 |
|
| longitude |
0 |
22895 |
22895 |
145.01 |
0.13 |
144.48 |
144.96 |
144.98 |
145.01 |
145.84 |
|
| price |
0 |
22895 |
22895 |
148 |
210.88 |
0 |
71 |
111 |
165 |
12624 |
|
What can you learn from this summary? Do we deal with spatial data?
Turning data spatial
Our dataset contains fields of latitude and longitude but it is not a spatial object. We can turn it into simple features one with st_as_sf function
listings_sf <- st_as_sf(listings, coords = c("longitude", "latitude"),
crs = 4326, remove = FALSE)
We specify crs to code 4326 whcih is used for latitude and longitude coordinates on the World Geodetic System 1984 (WGS84) reference ellipsoid. I also opted for argument remove = FALSE - try using help to see what that means and what happens when you change it.
Now see what changed in the dataset:
## Simple feature collection with 5 features and 9 fields
## geometry type: POINT
## dimension: XY
## bbox: xmin: 144.9774 ymin: -37.86453 xmax: 145.0921 ymax: -37.75897
## epsg (SRID): 4326
## proj4string: +proj=longlat +datum=WGS84 +no_defs
## # A tibble: 5 x 10
## id name host_id host_name neighbourhood latitude longitude room_type
## <dbl> <chr> <dbl> <chr> <chr> <dbl> <dbl> <chr>
## 1 9835 Beau~ 33057 Manju Manningham -37.8 145. Private ~
## 2 10803 Room~ 38901 Lindsay Moreland -37.8 145. Private ~
## 3 12936 St K~ 50121 Frank & ~ Port Phillip -37.9 145. Entire h~
## 4 15246 Larg~ 59786 Eleni Darebin -37.8 145. Private ~
## 5 16760 Melb~ 65090 Colin Port Phillip -37.9 145. Private ~
## # ... with 2 more variables: price <dbl>, geometry <POINT [°]>
Quick map
We now have spatial data so can plot it on the map. However the dataset is rather large and plotting it might not be that informative and slow to interact with. Let’s try a random sample of 1000 listings selected from our data with function sample_n:
tmap_mode("view")
listings_sf %>%
sample_n(1000) %>%
qtm()
The map already gets busy and we still have 2.189510^{4} more locations to plot!
Mapping locations - density
We will try to simplify the number of counts by aggregating them to a grid. Certain spatial operations work better when using projected coordinate system. We will start by converting our data from CRS:
GDA_1994_Geoscience_Australia_Lambert WKID: 3112 Authority: EPSG
To:
GDA_1994_Australia_Albers WKID: 3577 Authority: EPSG
By doing that we moved from unprojected (a.k.a. Geographic) system that used Latitude/Longitude for referencing location on the ellipsoid Earth, to projected system with X and Y coordinates (aka Easting/Northing) for referencing location on 2D representations of Earth
listings_sf <- st_transform(listings_sf, 3112)
See the difference between lattitude and longitude fields and geometry now:
listings_sf %>%
select(latitude, longitude, geometry) %>%
slice(1:5)
## Simple feature collection with 5 features and 2 fields
## geometry type: POINT
## dimension: XY
## bbox: xmin: 970445.7 ymin: -4300461 xmax: 981421.9 ymax: -4288699
## epsg (SRID): 3112
## proj4string: +proj=lcc +lat_1=-18 +lat_2=-36 +lat_0=0 +lon_0=134 +x_0=0 +y_0=0 +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs
## # A tibble: 5 x 3
## latitude longitude geometry
## <dbl> <dbl> <POINT [m]>
## 1 -37.8 145. (981421.9 -4291024)
## 2 -37.8 145. (971651 -4289471)
## 3 -37.9 145. (970445.7 -4299815)
## 4 -37.8 145. (972473.9 -4288699)
## 5 -37.9 145. (971710.3 -4300461)
Creating grid
We will now use locations of Airbnb listings to create regular, hexagonal grid over area of Melbourne where we have the data:
grid_hex <- listings_sf %>%
st_bbox() %>%
st_as_sfc() %>%
st_make_grid(n = 50, square = FALSE) %>%
st_sf() %>%
mutate(ID = row_number())
There are few steps needed to achieve that. In short, we extracted spatial extent (aka bounding box) of data of locations with st_bbox function, created a sf object with that that was used to construct the grid st_make_grid of dimensions 50 x 50 cells and finally we added variable derived from row numbers to be able to later merge data.
Let’s see how the grid looks like on map:
grid_hex %>%
qtm(fill = NULL)
Spatial join
We can now link points to polygons, or in other words perform a spatial join. This type of join can be used to bring information from one data source to the other and in our case we will assign each Airbnb listings to our newly created hexagon. We will then aggregate data and create new variable that summarises how many Airbnb listings there are in each hexagon:
grid_count <- st_join(listings_sf, grid_hex,join = st_intersects) %>%
st_drop_geometry() %>%
group_by(ID) %>%
summarise(room_count = n())
Check how the data looks like using slice.
Because data is not spatial, we cannot map it. Instead we have to merge it back to the grid by performing non-spatial join using “key” variable that identidfies the same observations in both datased. In our case this variable is ID:
grid_hex_count <- left_join(grid_hex, grid_count, by = c("ID", "ID"))
How Airbnb listings do we have in each hexagon:
Turns out majority of them have no Airbnbs at all and gave the value of NA. For the remaing part we have a very skewed distribution, ie. many locations with few counts.. and few with many:
grid_hex_count %>%
ggplot(aes(room_count)) +
geom_histogram()

You can get more detailed frequencies by running frq(grid_hex_count$room_count).
Finally, let’s plot a map, removing hexagons with no lisitngs:
grid_hex_count %>%
filter(!is.na(room_count)) %>%
tm_shape() +
tm_polygons(col = "room_count",
style = "fixed",
breaks = c(0, 10, 25, 50, 100, 250, 500, 2500, 5000),
alpha = 0.85, palette = "-RdYlGn", lwd = 0)
Mapping price - density
Exploring attributes
At next step we will try to map a variable in our data to visual encoding on map and change its representation. We will examine how price of listing varies across space.
To get some understanding lets first look at the data ignoring space.
Histogram is a good place to start:
listings_sf %>%
ggplot(aes(price)) + geom_histogram()

Again we have very skewd data - descriptive statistics confirm that:
##
## ## Basic descriptive statistics
##
## var type label n NA.prc mean sd se md trimmed
## dd numeric dd 22895 0 148 210.88 1.39 111 119.75
## range skew
## 12624 (0-12624) 26.33
We also have to remember that listings can belong to different categories:
listings_sf %>%
ggplot(aes(room_type)) + geom_bar()

To get the percentages we can use sjmisc package function frq:
frq(listings_sf$room_type)
##
## # x <character>
## # total N=22895 valid N=22895 mean=1.39 sd=0.52
##
## val frq raw.prc valid.prc cum.prc
## Entire home/apt 14379 62.80 62.80 62.80
## Private room 8116 35.45 35.45 98.25
## Shared room 400 1.75 1.75 100.00
## <NA> 0 0.00 NA NA
Clearly the price must depend on listing type. Box plot can be helpful to check that:
listings_sf %>%
ggplot(aes(room_type, price)) + geom_boxplot() + scale_y_log10()

Pay atention to scale_y_log10() option - I transformed the Y axis to be in logarithmic scale which helps us see the differences when dealing with skewed data. What is the difference when you remove this option?
Quick map
Once again let’s look at sample of 1000 rooms, but this time use colour to differentiate the prices:
listings_sf %>%
filter(room_type == "Private room") %>%
sample_n(1000) %>%
tm_shape() +
tm_dots(col = "price", style = "quantile", n = 5, palette = "seq")
It is hard to see the patterns here so we will try to create a different representation of this map.
Data prepration
First, we will need to create boundary of area for which we want to map the price. We can use our map of SA2 areas. Let’s change the coordinate system to match the listings data:
SA2_SEIFA <- st_transform(SA2_SEIFA, 3112)
Density calculation
tmap package provides capability for simple 2D kernel density estimator smooth_map . We will use it to map density of price (argument var!) variable per 1 km2 (argument bandwith!) selecting only private rooms from the dataset:
room_price_density <- listings_sf %>%
filter(room_type == "Private room") %>%
smooth_map(cover = SA2_SEIFA,
var = price,
bandwidth = 0.5,
breaks = c(0, 1, 5, 10, 25, 50, 100, 250, 500))
Examine the density object:
names(room_price_density)
## [1] "raster" "iso" "polygons" "bbox" "nrow" "ncol"
## [7] "cell.area" "bandwidth"
It turns out different representations were created. We can now use them to create maps. First the polygons:
tm_shape(room_price_density$polygons) +
tm_fill(col = "level", palette = "-RdYlGn",
title = "Airbnb room price density")
Now contour lines:
tm_shape(room_price_density$iso) +
tm_iso(col = "level", palette = "-RdYlGn",
title = "Airbnb room price density ")
It’s easier to see how price clearly depends on centrality of the listing.
Linking prices and deprivation
We will look at one more possibility of examining how the price varies in space. This time we will use spatial join to bring information about SEIFA indices to each listing.
Spatial overlay
We can link polygons to points in a similar way we obtained counts of listings in each hexagon. Such “spatial join” performed with the use of st_join function will assign values of SEIFA polygons to each listing point by “intersecting” or “overlaying them. This is possible because both of our dataset are geographical and we transformed them to the same CRS.
listings_sf_SEIFA <- st_join(listings_sf, SA2_SEIFA, join = st_intersects)
Now our listings have SEIFA indices fromSA2 area they belong to:
slice(listings_sf_SEIFA, 1:5)
## Simple feature collection with 5 features and 22 fields
## geometry type: POINT
## dimension: XY
## bbox: xmin: 970445.7 ymin: -4300461 xmax: 981421.9 ymax: -4288699
## epsg (SRID): 3112
## proj4string: +proj=lcc +lat_1=-18 +lat_2=-36 +lat_0=0 +lon_0=134 +x_0=0 +y_0=0 +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs
## # A tibble: 5 x 23
## id name host_id host_name neighbourhood latitude longitude room_type
## <dbl> <chr> <dbl> <chr> <chr> <dbl> <dbl> <chr>
## 1 9835 Beau~ 33057 Manju Manningham -37.8 145. Private ~
## 2 10803 Room~ 38901 Lindsay Moreland -37.8 145. Private ~
## 3 12936 St K~ 50121 Frank & ~ Port Phillip -37.9 145. Entire h~
## 4 15246 Larg~ 59786 Eleni Darebin -37.8 145. Private ~
## 5 16760 Melb~ 65090 Colin Port Phillip -37.9 145. Private ~
## # ... with 15 more variables: price <dbl>, SA2_MAIN16 <dbl>,
## # SA2_NAME16 <fct>, SA4_CODE16 <fct>, SA4_NAME16 <fct>, IRSD_s <dbl>,
## # IRSD_d <int>, IRSAD_s <dbl>, IRSAD_d <int>, IER_s <dbl>, IER_d <int>,
## # IEO_s <dbl>, IEO_d <int>, URP <dbl>, geometry <POINT [m]>
Relationship
Having price and SEIFA data together we can now examine relationship between them.
We can use st_drop_geometry function to get rid of geographical features of the data and turn it into normal data frame that will allow us do some calculations. Here is the tabulation of frequencies of deciles:
listings_sf_SEIFA %>%
st_drop_geometry() %>%
filter(room_type == "Private room") %>%
filter(!is.na(IRSAD_d)) %>%
group_by(room_type) %>%
frq(IRSAD_d)
##
## Grouped by:
## room_type: Private room
##
## # IRSAD_d <integer>
## # total N=8100 valid N=8100 mean=7.87 sd=2.01
##
## val frq raw.prc valid.prc cum.prc
## 1 108 1.33 1.33 1.33
## 2 86 1.06 1.06 2.40
## 3 200 2.47 2.47 4.86
## 4 188 2.32 2.32 7.19
## 5 343 4.23 4.23 11.42
## 6 677 8.36 8.36 19.78
## 7 1450 17.90 17.90 37.68
## 8 949 11.72 11.72 49.40
## 9 2377 29.35 29.35 78.74
## 10 1722 21.26 21.26 100.00
## <NA> 0 0.00 NA NA
Clearly more roioms are located in areas with higher scores.
Box plot of price across deciles of the index can help to visualize if there is a difference in price:
listings_sf_SEIFA %>%
st_drop_geometry() %>%
filter(room_type == "Private room") %>%
filter(!is.na(IRSAD_d)) %>%
ggplot(aes(as.factor(IRSAD_d), price)) +
geom_boxplot() + scale_y_log10()

Using score instead of decilescan also be helpful:
listings_sf_SEIFA %>%
st_drop_geometry() %>%
filter(room_type == "Private room") %>%
filter(!is.na(IRSAD_d)) %>%
ggplot(aes(IRSAD_s, price)) +
geom_point(alpha = 0.25) + scale_y_log10() +
geom_smooth()

It seems there is a slight increase of price. It might be worth investigating further and writing a report on it?